Red Wine Exploration by Paulo Casaretto

Univariate Plots Section

## [1] 1599   14
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "rating"
## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
##  $ rating              : Ord.factor w/ 3 levels "bad"<"average"<..: 2 2 2 2 2 2 2 3 3 2 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol      quality     rating    
##  Min.   : 8.40   3: 10   bad    :  63  
##  1st Qu.: 9.50   4: 53   average:1319  
##  Median :10.20   5:681   good   : 217  
##  Mean   :10.42   6:638                 
##  3rd Qu.:11.10   7:199                 
##  Max.   :14.90   8: 18

The high concentration of wines in the center region and the lack of outliers might be a problem for generating a predicting model later on.

There is a high concentration of wines with fixed.acidity close to 8 (the median) but there are also some outliers that shift the mean up to 9.2.

The distribution appears bimodal at 0.4 and 0.6 with some outliers in the higher ranges.

Now this is strange distribution. 8% of wines do not present critic acid at all. Maybe a problem in the data collection process?

A high concentration of wines around 2.2 (the median) with some outliers along the higher ranges.

We see a similar distribution with chlorides.

The distributions peaks at around 7 and from then on resembles a long tailed distribution with very few wines over 60.

As expected, this distribution resembles closely the last one.

The distribution for density has a very normal appearence.

pH also looks normally distributed.

For sulphates we see a distribution similar to the ones of residual.sugar and chlorides.

We see the same rapid increase and then long tailed distribution as we saw in sulfur.dioxide. I wonder if there is a correlation between the variables.

Univariate Analysis

What is the structure of your dataset?

There are 1599 observation of wines in the dataset with 12 features . There is one categorical variable (quality) and the others are numerical variables that indicate wine physical and chemical properties of the wine.

Other observations: The median quality is 6, which in the given scale (1-10) is a mediocre wine. The better wine in the sample has a score of 8, and the worst has a score of 3. The dataset is not balanced, that is, there are a more average wines than poor or excelent ones and this might prove challenging when designing a predicting algorithm.

What is/are the main feature(s) of interest in your dataset?

The main feature in the data is quality. I’d like to determine which features determine the quality of wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The variables related to acidity (fixed, volatile, citric.acid and pH) might explain some of the variance. I suspect the different acid concentrations might alter the taste of the wine. Also, residual.sugar dictates how sweet a wine is and might also have an influence in taste.

Did you create any new variables from existing variables in the dataset?

I created a rating variable to improve the later visualizations.

Of the features you investigated, were there any unusual distributions? Did

you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Citric.acid stood out from the other distributions. It had (apart from some outliers) an retangularly looking distribution which given the wine quality distribution seems very unexpected.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

Alcohol has negative correlation with density. This is expected as alcohol is less dense than water.

Volatile.acidity has a positive correlation with pH. This is unexpected as pH is a direct measure of acidity. Maybe the effect of a lurking variable?

Residual.sugar does not show correlation with quality. Free.sulfur.dioxide and total.sulfur.dioxide are highly correlated as expected.

Density has a very strong correlation with fixed.acidity. The variables that have the strongest correlations to quality are volatile.acidity and alcohol.

As the correlation table showed, fixed.acidity seems to have little to no effect on quality.

volatile.acidity seems to be an unwanted feature is wines. Quality seems to go up when volatile.acidity goes down. The higher ranges seem to produce more average and poor wines.

We can see the soft correlation between these two variables. Better wines tend to have higher concentration of citric acid.

Contrary to what I initially expected residual.sugar apparently seems to have little to no effect on perceived quality.

Altough weakly correlated, a lower concentration of chlorides seem to produce better wines.

The ranges are really close to each other but it seems too little sulfur dioxide and we get a poor wine, too much and we get an average wine.

As a superset of free.sulfur.dioxide there is no surprise to find a very similar distribution here.

Better wines tend to have lower densities, but this is probably due to the alcohol concentration. I wonder if density still has an effect if we hold alcohol constant.

Altough there is definitely a trend (better wines being more acid) there are some outliers. I wonder how the distribution of the different acids affects this

It is really strange that an acid concentration would have a positive correlation with pH. Maybe Simpsons Paradox?

Altought it is not clear what each cluster means, it seems Simpsons paradox is in fact present.

Because we know pH measures acid concentration using a log scale, it is not surprise to find stronger correlations between pH the log of the acid concentrations. We can investigate how much of the variance in pH these tree acidity variables can explain using a linear model.

## 
## Call:
## lm(formula = pH ~ I(log10(citric.acid)) + I(log10(volatile.acidity)) + 
##     I(log10(fixed.acidity)), data = subset(wine, citric.acid > 
##     0))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47184 -0.06318 -0.00003  0.06447  0.32265 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 4.230862   0.040578 104.266  < 2e-16 ***
## I(log10(citric.acid))      -0.052187   0.008797  -5.933 3.72e-09 ***
## I(log10(volatile.acidity)) -0.049788   0.021248  -2.343   0.0193 *  
## I(log10(fixed.acidity))    -1.071983   0.038987 -27.496  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1068 on 1463 degrees of freedom
## Multiple R-squared:  0.4876, Adjusted R-squared:  0.4866 
## F-statistic: 464.1 on 3 and 1463 DF,  p-value: < 2.2e-16
## Warning in loop_apply(n, do.ply): Removed 132 rows containing non-finite
## values (stat_boxplot).

It seems the three acidity variables can only explain half the variance in PH. The mean error is specially bad on poor and on excellent wines. This leads me to believe that there are other component that affect acidity.

Interesting. Altough there are many outliers in the medium wines, better wines seem to have a higher concentration of sulphates.

The correlation is clear here. With an increase in alcohol graduation we see an increase in the concentration of better graded wines. Given the high number of outliers it seems we cannot rely on alcohol alone to produce better wines. Let’s try using a simple linear model to investigate.

## 
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8442 -0.4112 -0.1690  0.5166  2.5888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.12503    0.17471  -0.716    0.474    
## alcohol      0.36084    0.01668  21.639   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared:  0.2267, Adjusted R-squared:  0.2263 
## F-statistic: 468.3 on 1 and 1597 DF,  p-value: < 2.2e-16

Based on the R-squared value it seems alcohol alone only explains about 22% of the variance in quality. We’re going to need to look at the other variables to generate a better model.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features in the dataset?

Fixed.acidity seems to have little to no effect on quality

Quality seems to go up when volatile.acidity goes down. The higher ranges seem to produce more average and poor wines.

Better wines tend to have higher concentration of citric acid.

Contrary to what I initially expected residual.sugar apparently seems to have little to no effect on perceived quality.

Altough weakly correlated, a lower concentration of chlorides seem to produce better wines.

Better wines tend to have lower densities.

In terms of pH it seems better wines are more acid but there were many outliers. Better wines also seem to have a higher concentration of sulphates.

Alcohol graduation has a strong correlation with quality, but like the linear model showed us it cannot explain all the variance alone. We’re going to need to look at the other variables to generate a better model.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I verified the strong relation between free and total sulfur.dioxide.

I also checked the relation between the acid concentration and pH. Of those, only volatile.acidity surprised me with a positive coefficient for the linear model.

What was the strongest relationship you found?

The relationship between the variables total.sulfur.dioxide and free.sulfur.dioxide.

Multivariate Plots Section

Alcohol and other variables

When we hold alcohol constant, there is no evidence that density affects quality which confirms our earlier suspicion.

## Warning in loop_apply(n, do.ply): Removed 8 rows containing missing values
## (geom_point).

## Warning in loop_apply(n, do.ply): Removed 1 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 7 rows containing missing values
## (geom_point).

Interesting! It seems that for wines with high alcohol content, having a higher concentration of sulphates produces better wines.

The reverse seems to be true for volatile acidity. Having less acetic acid on higher concentration of alcohol seems to produce better wines.

Low pH and high alcohol concentration seem to be a good match.

Acid exploration

Almost no variance in the y axis compared to the x axis. Lets try the other acids.

High citric acid and low acetic acid seems like a good combination.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$citric.acid and wine$fixed.acidity
## t = 36.2341, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

Altough there seems to a correlation between tartaric acid and citric acid concentrations, nothing stands out in terms of quality.

Linear model

## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity, 
##     data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity + 
##     citric.acid, data = training_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity + 
##     citric.acid + fixed.acidity, data = training_data)
## m6: lm(formula = as.numeric(quality) ~ alcohol + sulphates + pH, 
##     data = training_data)
## 
## =============================================================================
##                      m1        m2        m3        m4        m5        m6    
## -----------------------------------------------------------------------------
## (Intercept)       -0.066    -0.604**   0.605*    0.670**   0.294     1.328*  
##                   (0.220)   (0.224)   (0.248)   (0.257)   (0.289)   (0.516)  
## alcohol            0.357***  0.339***  0.306***  0.305***  0.315***  0.362***
##                   (0.021)   (0.020)   (0.020)   (0.020)   (0.020)   (0.021)  
## sulphates                    1.099***  0.745***  0.770***  0.780***  0.980***
##                             (0.138)   (0.137)   (0.139)   (0.138)   (0.139)  
## volatile.acidity                      -1.199*** -1.272*** -1.333***          
##                                       (0.125)   (0.146)   (0.147)            
## citric.acid                                     -0.128    -0.436*            
##                                                 (0.130)   (0.170)            
## fixed.acidity                                              0.047**           
##                                                           (0.017)            
## pH                                                                  -0.631***
##                                                                     (0.152)  
## -----------------------------------------------------------------------------
## R-squared             0.232    0.280     0.343     0.344     0.349     0.293 
## adj. R-squared        0.231    0.279     0.341     0.341     0.346     0.291 
## sigma                 0.704    0.682     0.651     0.651     0.649     0.676 
## F                   289.048  185.949   166.182   124.873   102.212   131.779 
## p                     0.000    0.000     0.000     0.000     0.000     0.000 
## Log-likelihood    -1022.548 -991.540  -947.687  -947.203  -943.227  -983.004 
## Deviance            473.685  444.023   405.216   404.808   401.465   436.188 
## AIC                2051.096 1991.080  1905.374  1906.407  1900.454  1976.008 
## BIC                2065.693 2010.544  1929.704  1935.602  1934.516  2000.337 
## N                   959      959       959       959       959       959     
## =============================================================================

I did not include pH in the same formula with the acids to avoid colinearity problems.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

High alcohol contents and high sulphate concentrations combined seem to produce better wines.

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created several models. The most prominent of them was composed of the variables alcohol, sulphates, and the acid variables. There are two problems with it. First the low R squared score suggest that there is missing information to propely predict quality. Second, both the residuals plot and the cross validation favors average wines. This is probably a reflection of the high number of average wines in the training dataset or it could mean that there is missing information that would help predict the edge cases. I hope that the next course in the nanodegree will help me generate better models :) .


Final Plots and Summary

Plot One

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

Description One

This is a very strange distribution. It does not match what we would expect from a variable collected in a experimental situation.

Plot Two

## Warning in loop_apply(n, do.ply): Removed 8 rows containing missing values
## (geom_point).

Description Two

High alcohol contents and high sulphate concentrations combined seem to produce better wines.

Plot Three

Description Three

The linear model with the highest R squared value could only explain around 35% of the variance in quality. Also, the clear correlation showed by the residual plot earlier seems to reinforce that there is missing information to better predict both poor and excellent wines.